library(tidyverse)
df <- read_csv("spotify-2017.csv",
col_types = cols(mode = col_character()))
df <- df %>% mutate(mode = fct_recode(mode,
"Major" = "1.0",
"Minor" = "0.0"))
df <- df %>% select(name, artists, loudness, mode, valence, tempo)
Download the Spotify data and Spotify starter R script from the course website and save them in the folder that you have been using for this class.
Open the R file and set R’s working directory to the folder that this file is in. (You can do that through the menu bar at the top of the screen: Session > Set Working Directory > To Source File Location.) Run the code in the R script and try to understand what the code is doing.
Each song comes in a particular key, and keys are grouped into “Major” keys and “Minor” keys. Major keys are generally happy-sounding, while minor keys sound more melancholic. Let’s say we want to test if the difference in tempo (speed of the song) between songs in major and minor keys. One way to do this is to test if the mean tempo of songs in major keys is different from that for songs in minor keys.
ggplot(df, aes(mode, tempo, fill = mode))+
geom_boxplot()
The code below computes the tempo means for each mode:
df %>% group_by(mode) %>%
summarize(mean_tempo = mean(tempo))
## # A tibble: 2 x 2
## mode mean_tempo
## <fct> <dbl>
## 1 Minor 116.
## 2 Major 122.
We see that songs in minor key have mean tempo of ~116 bpm, while songs in major key have a slightly faster mean tempo of ~122 bpm. Is this difference significant? To test for significance, one option we have is to use the two-sided \(t\)-test. (Whether the \(t\)-test is appropriate is a question left for a statistical methods class.)
The code for performing a two-sided \(t\)-test is below:
major_data <- (df %>% filter(mode == "Major"))$tempo
minor_data <- (df %>% filter(mode == "Minor"))$tempo
t.test(major_data, minor_data, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: major_data and minor_data
## t = 1.0377, df = 97.475, p-value = 0.302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.152959 16.446581
## sample estimates:
## mean of x mean of y
## 121.5741 115.9273
From the readout, the \(p\)-value for this test is around 0.30, which is fairly large. We wouldn’t reject the null hypothesis in favor of the alternative hypothesis.
The readout also gives us a confidence interval: this is an interval which will capture the true value of the difference 95% of the time (if we were to repeat this procedure over and over again).
The \(t\)-test makes some strong assumptions on how the data was generated. (We call it a parametric test). If we don’t want to make assumptions on how the data was generated, we can use a non-parametric test, such as the Kolmogorov-Smirnov test, which tests whether the distribution of 2 variables is the same or not:
ks.test(major_data, minor_data, alternative = "two.sided")
## Warning in ks.test(major_data, minor_data, alternative = "two.sided"): cannot
## compute exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: major_data and minor_data
## D = 0.13136, p-value = 0.7946
## alternative hypothesis: two-sided
We can start with checking the correlation between valence and loudness. Correlation is a measure between -1 and 1. Values closed to -1 mean that there is negative association between the variables, values closed to 1 imply positive dependence. We observe some positive correlation: the higher the valence the higher the loudness.
df %>% summarize(cor(valence, loudness))
## # A tibble: 1 x 1
## `cor(valence, loudness)`
## <dbl>
## 1 0.408
Next, let’s perform linear regression on valence vs. loudness. We expect increasing loudness to result in increasing valence. Linear regression is easily done in R with the lm function:
lm(valence ~ loudness, data = df)
##
## Call:
## lm(formula = valence ~ loudness, data = df)
##
## Coefficients:
## (Intercept) loudness
## 0.79386 0.04897
The line above finds the coefficients \(a\) and \(b\) such that the line \(valence = a + b \cdot loudness\) fits the data best. From the output, we can see that the line of best fit is \(valence = 0.79386 + 0.04897 \cdot loudness\). The coefficients mean that for every unit increase in the loudness scale, valence increases correspondingly by 0.04897. This matches our expectations.
The output of the lm function doesn’t give us any information other than the coefficients. We can use the summary function to get more information:
fit <- lm(valence ~ loudness, data = df)
summary(fit)
##
## Call:
## lm(formula = valence ~ loudness, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39188 -0.12018 -0.00328 0.14860 0.40816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79386 0.06570 12.08 < 2e-16 ***
## loudness 0.04897 0.01108 4.42 2.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1986 on 98 degrees of freedom
## Multiple R-squared: 0.1662, Adjusted R-squared: 0.1577
## F-statistic: 19.54 on 1 and 98 DF, p-value: 2.548e-05
Residuals refer to the differences between the actual score, and the score predicted by the linear regression. This is something that we look at to check if the linear regression model makes sense.
Instead of just the value of the coefficients, we have more information in a table. Look at the p-values in the last column. The p-value for loudness is the result of testing the null hypothesis coefficient for loudness \(= 0\) (i.e. loudness is uncorrelated with valence) vs. the alternative hypothesis coefficient for loudness \(\neq 0\) (i.e. loudness is correlated with valence). The p-value is very small, and so the data is not consistent with the null hypothesis (i.e. no relationship). We would conclude that there is a statistically signifcant relationship between loudness and valence.
We can plot the linear fit on a scatterplot using geom_smooth and specify the method as “lm”:
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
The default method is not linear, but also represents the trend.
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It is likely that valence is influenced by more than 1 variable. We can fit more separate linear models for different mode values. We observa that slope is a bit different for major and minor modes.
ggplot(data = df, mapping = aes(x = loudness, y = valence, color = mode)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
You can use facets instead.
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
facet_wrap(~mode) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
How can we get predictions from the model? We can use the predict function, with the first argument being the model (i.e. output of lm), and the second argument being the data on which to predict. (The data must be in the exact same form as the data the model was trained on, columns and all.)
This code gives the predictions on the training dataset:
fit <- lm(valence ~ loudness, data = df)
predict(fit, df)
## 1 2 3 4 5 6 7 8
## 0.6379884 0.5819175 0.5609092 0.4623810 0.5840722 0.4588062 0.4708529 0.5469037
## 9 10 11 12 13 14 15 16
## 0.5509193 0.3837838 0.4821161 0.4790799 0.5478342 0.5768246 0.3631673 0.5874511
## 17 18 19 20 21 22 23 24
## 0.6047376 0.5554735 0.5946497 0.5830438 0.5531719 0.5613010 0.4315787 0.5962658
## 25 26 27 28 29 30 31 32
## 0.4840259 0.4577289 0.6286351 0.4125783 0.5196763 0.5695280 0.5029774 0.4496977
## 33 34 35 36 37 38 39 40
## 0.6423468 0.6323079 0.4874048 0.4151737 0.2367261 0.4709508 0.5198721 0.5085110
## 41 42 43 44 45 46 47 48
## 0.4847605 0.5946497 0.6418081 0.5877449 0.5733966 0.4635563 0.2325636 0.6508186
## 49 50 51 52 53 54 55 56
## 0.5336818 0.5821623 0.6765280 0.3319243 0.5338776 0.5489115 0.5056218 0.6324058
## 57 58 59 60 61 62 63 64
## 0.5656104 0.4548396 0.2845210 0.5841701 0.4626749 0.5228104 0.4973948 0.5312332
## 65 66 67 68 69 70 71 72
## 0.5272177 0.4919101 0.3236972 0.4668373 0.6380864 0.5148282 0.5671284 0.6288310
## 73 74 75 76 77 78 79 80
## 0.4828506 0.4213440 0.4859357 0.6423957 0.3884359 0.6217303 0.4815284 0.5428392
## 81 82 83 84 85 86 87 88
## 0.5881857 0.6433751 0.5726131 0.5057687 0.5280012 0.4206584 0.4884332 0.5445042
## 89 90 91 92 93 94 95 96
## 0.4801572 0.5442104 0.4097380 0.5934255 0.5237408 0.5318699 0.3909334 0.5607133
## 97 98 99 100
## 0.5171298 0.4400016 0.5538085 0.4709998
Let’s plot these predictions on the scatterplot to make sure we got it right:
df$predictions <- predict(fit, df)
ggplot(data = df) +
geom_point(aes(x = loudness, y = valence)) +
geom_smooth(aes(x = loudness, y = valence), method = "lm", se = FALSE) +
geom_point(aes(x = loudness, y = predictions), col = "red")
## `geom_smooth()` using formula 'y ~ x'
Here are the predictionrs on a randomly generated set of data:
new_df <- data.frame(loudness = c(-10, -3.6, -3.74, -8, -5.22))
new_df <- new_df %>% mutate(valence = predict(fit, new_df))
new_df
## loudness valence
## 1 -10.00 0.3041581
## 2 -3.60 0.6175678
## 3 -3.74 0.6107120
## 4 -8.00 0.4020986
## 5 -5.22 0.5382360
ggplot(data = df, aes(x = loudness, y = valence)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_point(data = new_df, col = "red")
## `geom_smooth()` using formula 'y ~ x'
If you plot the output of the lm call, you will get a series of plots which give you more information on the fit:
fit <- lm(valence ~ loudness, data = df)
par(mfrow = c(2,2))
plot(fit)
The following code gives the mean tempo for all the songs:
mean(df$tempo)
## [1] 119.2025
We can use the following code to get an \(x\)% confidence interval for the mean tempo:
x <- 0.9
t.test(df$tempo, conf.level = x)
##
## One Sample t-test
##
## data: df$tempo
## t = 42.644, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 114.5612 123.8437
## sample estimates:
## mean of x
## 119.2025
To get confidence intervals for parameters in a linear model, we can use the confint function (if level = x in the confint call, it will give the endpoints of the x% confidence interval):
x <- 0.95
fit <- lm(valence ~ loudness, data = df)
confint(fit, level = x)
## 2.5 % 97.5 %
## (Intercept) 0.66349030 0.92423126
## loudness 0.02698615 0.07095438
It might be informative to have the song names in the scatterplot, especially when we want to identify outliers. The following plot is pretty crowded, so putting all the song names might not be a good idea. In any case, here is how you can do it (play around with the different options in geom_text to see what they do):
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
geom_text(aes(label = name, col = mode),
hjust = 0, nudge_x = 0.1, angle = 45, size = 3) +
facet_grid(. ~ mode)
It’s a little bit of a mess. We can do slightly better by loading the ggrepel package, and replacing geom_text with geom_text_repel (and some change of options):
library(ggrepel)
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
geom_text_repel(aes(label = name, col = mode), size = 3) +
facet_grid(. ~ mode)
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
plotly for interactive figuresThe plotly package is great for making interactive figures with simple syntax. For example, here is the plot of valence vs. loudness, with the color denoting the mode of the song:
library(plotly)
plot_ly(data = df, x = ~loudness, y = ~valence, color = ~mode)
If you hover the cursor over each observation, you will see that it gives the values of loudness, valence and mode for this point. The code below changes the mouseover information to the song title:
plot_ly(data = df, x = ~loudness, y = ~valence, color = ~mode,
hoverinfo = "text", text = ~name)
(To keep the original information that the data labels had, remove the hoverinfo option from the function call above.) More documentation on the plotly package can be found at https://plot.ly/r/.